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THE PROBABLE ERROR OF A MENDELIAN 
CLASS FREQUENCY 1 

DR. RAYMOND PEARL 

1. With the increasing volume of Mendelian experi- 
mentation there is an ever-growing need for adequate 
and clearly understood tests for the statistical significance 
of differences between observed results and expectation. 
A number of different methods of making such tests have 
been proposed and used by different workers. For ex- 
ample, early in the discussion of Mendelism, "Weldon 2 
made use of the ordinary, and frequently inadequate, ex- 
pression for the standard deviation of a sub-frequency 
o = \/npq_. Johannsen 3 has also made much use of this 
method. It has several defects. In the first place it as- 
sumes the Gaussian distribution of the errors, an assump- 
tion not often strictly warranted, as Pearson 4 has clearly 
shown, and in many cases grossly in error. In the second 
place it is not even approximately adequate under certain 
extreme conditions (frequently realized in Mendelian 
work) of class frequency. Harris 5 has proposed the x 2 
"goodness of fit" test for comparing observed and ex- 
pected Mendelian distributions. There are several fea- 
tures of this method which greatly limit it for such use. 
Among these are (1) its failure to make correct allowance 
for "tail" frequencies (it is just this class of very small 
frequencies which one most often wants to test in prac- 

i Papers from the Biological Laboratory of the Maine Agricultural Ex- 
periment Station No. 108. The author is greatly indebted to his assistant, 
Mr. John Eice Miner, for the laborious arithmetic involved in Section 6 of 
the paper. 

2 Weldon, W. F. E., ' ' Mendel 's Laws of Alternative Inheritance in Peas, ' ' 
Biometrilca, Vol. I, pp. 228-265, 1902. 

3 Johannsen, W., ' ' Elemente der exakten Erblichkeitslehre, ' ' Zweite Aus- 
gabe, Jena, 1913. 

* Pearson, K., ' ' On the Influence of Past Experience on Future Expecta- 
tion," Phil. Mag., March, 1907, pp. 365-378. 

s Harris, J. A., "A Simple Test of the Goodness of Pit of Mendelian 
Eatios," Amek. Nat., Vol. XL VI, pp. 741-745, 1912. Cf. also Pearson, 
K., and Heron, D., "On Theories of Association," Biometrika, Vol. IX, pp. 
159-315, 1913. 
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tical Mendelian work), and (2) the fact that the test takes 
into account only the magnitude of the error and not its 
direction (i. e., whether in excess or defect) in any par- 
ticular case. (3) It gives a result not particularly well 
adapted to the actual needs of Mendelian research. The 
x 2 test gives a measure of the goodness of fit of the 
whole distribution, and only that. Now besides being in- 
terested in that point the Mendelian worker quite as often 
wants to know, in addition, something about the prob- 
ability that particular, classes observed are significantly 
different from the expected. To that sort of knowledge 
the x 2 test helps him not at all. It is an "all or none" 
sort of method. 8 

2. It has seemed to the writer that it would be useful 
to discuss methods of determining the true probable error 
of each class frequency in Mendelian distributions as a 
supplement to the x 2 test, and for use in cases where it is 
not applicable. The fundamental theorems have all been 
given by Pearson 7 in a very important paper, which is 
apparently almost entirely unknown to biologists. The 
purpose of the present paper is first to show the appli- 
cability of these theorems to the problem in hand, and 
second to point out some matters regarding the practical 
use of the method likely to be helpful to biologists with 
but little mathematical training who may attempt to 
use it. 

3. In the paper referred to, Pearson, starting from 
Bayes' theorem, shows that the distribution of chances 
of an event occurring in a particular way in a second 

8 I have earlier pointed out other objections to the x~ test in Mendelian 
work, in particular its total failure to deal with cases where experiment 
yields a small frequency on classes where the expectation is zero, and need 
not further discuss them here. I have never thought it necessary to make 
any rejoinder to Pearson's characteristically bitter reply to my criticism, 
nor do I yet. The x~ test leads to this absurdity: if I perform a Men- 
delian experiment in which I get ten thousand million offspring agreeing 
-perfectly with expectation save for one lone individual (perhaps a mutation, 
perhaps a mistake in the record, or what not) which is of a sort not ex- 
pected, then Pearson and the x 2 test agree that the probability is infinitely 
great that the ten thousand million offspring do not follow Mendelian law! 

' Pearson, K., loc. cit. 
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sample from a population from which a first sample has 
produced a certain value is given, not by the ordinates of 
a normal curve of errors, as is commonly assumed in writ- 
ings on the theory of probability, but by the successive 
terms of a simple hypergeometrical series. In an earlier 
paper the same author 8 had solved the problem of the 
momental properties of the hypergeometrical series. 
Combining the two results he was able to derive the neces- 
sary equations for the complete solution of the problem 
of probable errors in sampling. "We may proceed at once 
to the exposition of these results, referring the reader for 
the proofs to the papers of Pearson cited. 

Let it be supposed that a first sample of n = p + 2 be 
drawn from the population, p denoting the number of 
times the event dealt with occurs in the n trials, and q 
the number of times it fails. 

Write 



whence of course 



P - 9 

1 n * n 



p + a 



We then have for the chief constants of the error dis- 
tribution for a second sample, of magnitude m, drawn 
from the same population the following values : 

Mean 9 = mp + ^— ^ (S~P)> © 

Mode = the integral portion of mp + p, (ii) 

Standard Deviation = j m ( p -\ 7^7, J 

*(»-£i)0+^t)}' « 

These values are entirely general, and independent of 
the values of n, m, p and q. Under certain circumstances, 

s Pearson, K., "On Certain Properties of the Hypergeometrical Series, 
and on the Fitting of such Series to Observation Polygons on the Theory 
of Chance," Phil. Mag., Feb., 1899, pp. 236-246. 

• From origin at the lower range end, or r = 0. 
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as when n is very large as compared with m, and neither 
p nor q are very small, (i) and (iii) are obviously capable 
of being put in much simpler form and still giving a 
sufficiently close approximation to the true result. For 
Mendelian work, however, where frequently neither of 
these conditions are even approximately realized, it will 
in general be better to use the full expression as given 
above. 

The ordinates of the error distribution (the chances of 
different occurrences) are given by the successive terms 
of the hypergeometrical series 

„_„], m g+1 m(m-l) (p+l)(p+2) 
G r -Lo|l+ 1 ■ q+m + 2 " (g+m)(?+ro-l) 

, m(m-l)(m-2) (p+l)(p+2)(p+3) 



|3 (g+m)(?+m-l)(g+m-2) 

m(m— l)(m— 2)(m— 3) 



(iv) 



where 



|4 

(p+l)(p+2)(p+3)(p+4) . 
X (g+m)(g+m-l)(g+m-2)(g+m-3) "*" 



r(g+m+l)r(ra+2) 
r(g+l)r(n+m+2) ' 






As we shall presently see, the calculation of the terms 
in (iv) becomes a tedious and laborious matter when the 
number needed is at all considerable. Under such cir- 
cumstances, and when in addition m and n are even mod- 
erately large, equation (iv) may be greatly simplified, 
without significant loss of accuracy, by the use of Ster- 
ling's theorem (to the bracket) or by Forsyth's approxi- 
mation for such of the factorials as are not included in 
the range of the Pearson 10 tables. Thus we have, by 
Sterling 's theorem, remembering that r denotes any term 
in the series, and writing s = m — r, 

io "Tables for Statisticians and Biometricians, ' ' edited by Karl Pear- 
son, Cambridge, 1914. 
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\V + r f m ™+i( q + g )9+s+i 



C r — Co 



(vi) 



Using Forsyth's approximation, which is extremely ac- 
curate, one gets 

l£il f (TO«+m+i)" >H {(g+») 8 +(g+<)+i} g+ ' + * l* / -N 
' ° W I (* 2 +^) 8+ M(g+™) 2 +(<z+m)+| )*+">+* J ' ^ Vllj 

The gamma terms in (v) will, of course, be calculated 
by some one of the well-known approximations (e. g., 
Sterling's, Pearson's, Forsyth's) or by interpolation 
from a table of factorials (Pearl 11 ). 

4. The proposal which I wish to make for the expres- 
sion of a Mendelian result is that the expectation be 
expressed as the quartile limits for each class frequency 
in a second sample of the same size as the observed 
sample. In using such an expression it must be clearly 
understood that it does not measure the goodness of fit 
of the distribution as a whole, because it takes no account 
of correlations in errors. What it does give, in supple- 
ment to the x 2 test, is the limits of probability of each class 
frequency, taken by itself. 

The ordinary expression for a probable error (e. g., 
P. E. mean= + .67449(o-/V»)) gives the quartile limits 
(i. e., the limits within which one half the frequency oc- 
curs) on the assumption that the distribution is Gaussian, 
since in such a distribution of unit area the quartile limits 
are .6744898 . . . times the standard deviation on either 
side of the mean. But in our per sent work we are making 
no assumption that the error distribution is Gaussian. 
Consequently we must determine the quartiles directly 
from the distribution. In cases where the number of 
terms is not too great the ordinates may be calculated 
from (iv) or (vi) and summed to find the quartile. In 
many cases, however, this would be practically too tedious 

ii Pearl, E., "Interpolation as a Means of Approximation to the Gamma 
Function for High Values of n," Science, N. S., Vol. XLI, pp. 506-507, 
1915; "On the Degree of Exactness of the Gamma Function Necessary in 
Curve Fitting," Ibid., Vol. XLII, pp. 833-834, 1915. 
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an operation, and we may resort to an approximate 
method. The simplest one is to take .67449 o- on either 
side of the median, which is approximately determined 
by remembering that the median lies between the mean 
and the mode and approximately twice as far from the 
mode as from the mean. The criterion of whether this 
method of fixing the quartile limits may be safely applied 
will be found in the value of the skewness, 8k. In prac- 
tical work this approximate method will give sufficiently 
accurate results unless the skewness is very large, say 
> 0.6. 
We have by definition 

Sk = - . (viii) 

Hence having calculated the values of mean, mode and <r 
by (i), (ii) and (iii) we can readily obtain (viii), since 
d = mean — mode. 

We may now pass to the consideration of some numer- 
ical examples, by means of which certain facts can be- 
better brought out than by further theoretical discussion. 

5. As a first and simple example we may take some 
data, recently published by F. L. Piatt, 12 on the results of 
mating Blue Andalusian fowls. On account of the fre- 
quency with which the Blue Andalusian case is cited as a 
paradigm in Mendelism, coupled with the great dearth in 
the literature of exact statistics of actual matings of this 
breed of poultry, it seems especially worth while to dis- 
cuss these statistics furnished by Mr. Piatt, on the au- 
thority of Mr. W. J. Coates, a breeder of Andalusians. 

Table I gives the data, and in the last line, the Men- 
delian expectation expressed in the form suggested in this 
paper. 

The occurrence of the "dark reds," which Mr. Coates 
informs us had a pattern like a Bed Game, is a phenom- 
enon not mentioned in textbook accounts of Mendelian in- 
heritance in the Blue Andalusian. In the present con- 

12 Piatt, I\ L., "Western Notes and Comment," Reliable Poultry Journal, 
Vol. XXIII, p. 665, 1916. 
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TABLE I 
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Showing the Result of Mating Blub x Blue in the Blue Andaltjsian 
Breed of Poultry (Coates' Data) 





Offspring 


Mating 


White 


Blue 


Black 


Dark 
Red 


A 


4 
4 
3 

3 


10 
5 
3 

12 
3 


3 
2 

1 
1 


1 

3 




B 


C 


D 


M 




Total observed 


14 


33 


7 


4 






Total observed in categories (1) white, (2) 


14 


33 


11 






If the true ratio is 1 white : 2 blue : 1 pig- 
mented not blue it is an even chance, 
considering each class by itself, that the 
frequencies in a sample of this size will 
ifall between 


11.5 and 
17.8 


25.8 and 
82.9 






17.8 



nection, however, we can not pursue that point, but will 
group together, as in the penultimate line of the table, the 
blacks and reds as "pigmented, not blue," and assume 
that the three classes should occur in a 1 : 2 : 1 ratio. Do 
the actual results bear out this assumption, having regard 
to the errors of sampling! 

Examining the last two lines of the table, it is clear that 
each observed class, taken by itself, is by no means an im- 
possible approximation to what would be demanded by a 
1:2:1 ratio. The blues and the "pigmented not blues" 
fall outside the range for which the probability is J but 
only slightly outside. It would be practically an even 
bet, if Blue Andalusians really follow a law of 1:2:1 
segregation when bred together, that any particular sam- 
ple of 58 offspring would show in each particular class as 
great a deviation as the present sample. 13 

Now we may consider in detail the mode of calculating 
the figures in the last line of Table I. 

is Always on the assumption, of course, that it is legitimate to lump the 
blacks and reds together. There is room for scepticism on that point, but 
we are here only concerned with the case as an illustration of method. 
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We have, by hypothesis, and from the statistics 

m = n = 58. 

A distribution of 58 in a 1 : 2 : 1 ratio is 14.5 : 29 : 14.5. 
Assume a first sample of 58 to show exactly this distribu- 
tion 14.5 : 29 : 14.5, what will be the mean frequency of one 
of the end classes, say white, expected in a second sample 
of 58? 

We have 

whence, by (i), 
and, by (ii), 



- 14.5 ■ 

* = 18" = - 25 ' 

q = -75, 
Mean = 14.9833, 

Mode = 14, 
d = .9833. 

By the approximate method we get 

Median = 14.656 approximately. 
The standard deviation from (iii), is 

a = 4.6364, 
and, by (viii), 

8k. = .2121. 

Actual tests with curves of a degree of skewness no 
greater than this show that the approximate method gives 
the quartile limits with sufficient accuracy for practical 
purposes. We have for the approximate quartile limits, 
.67449 X 4.6364 = 3.1272. This value, added to and sub- 
tracted from the median 14.656, gives the results set down 
in the last line of Table I. 

Exactly the same procedure, with different numbers, is 
followed in the case of the blue column. 

6. Let us now consider a more completely worked out 
illustration. Some time ago Mr. Alexander Weinstein, of 
Columbia University, consulted the writer in regard to a 
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problem which had arisen in connection with his Men- 
delian breeding experiments. A certain type of mating 
gave the following class frequencies : 

6363 + 579 + 3638 + 1208 + 128 + 115 

+ 350 + 6 = 12387 . . . (a). 

Another group of matings gave a total of 9,017 off- 
spring, of which 30 fell in the x class, this being the only- 
class in regard to which a comparison is to be made. On 
certain theoretical grounds the percentage frequency in 
this x class in the second sample would be expected to be 
0.582 times the percentage frequency of this same class in 
the first sample. The question is whether the actually ob- 
served frequency of 30 in this second sample is such as 
could reasonably be expected to occur if the theoretical 
assumption actually were the fact. 

It will be seen at once that, owing to the very small ab- 
solute frequency of this x class in both samples, ordinary 
probable error methods will be of no avail. 

Approaching the problem by the method here proposed, 
we have, as basic values for the computations, 

'n = 12387 

m = 9017 

p = 115, q = 12272 

p = p/n = .009284 

q = q/n = . 990716. 

Whence we have for the mean in the second sample of 
9017 by (i) 

Mean = 84.428137 
and by (iii) 

<r = 12.020652. 

By (ii) the mode = 83, whence d = 1.428137 

and 

1 4281 37 
Sh = 12^020652 = °- 118807 - 
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"Working directly from the moments of the hypergeo- 
metrical series and, in effect, replacing that series with a 
true curve, we find 

Mode = 83.222141, 

d = 1.205996, and 
Sk= 0.100327 + .008696. 

TABLE II 

Showing the Successive Ordinates of the Hypergeometrical Series 
for the Second Sample 



■/• 


C r 


Sum 


r 


Or 


Sum 


r 


O r 


Sum 


35 


.000000 


.000000 


73 


.022620 


.182838 


111 


.003215 


.983125 


36 


.000001 


.000001 


74 


.024355 


.207193 


112 


.002740 


.985865 


37 


.000001 


.000002 


75 


.025996 


.233189 


U3 


.002325 


.988190 


38 


.000002 


.000004 


76 


.027539 


.260728 


114 


.001964 


.990154 


39 


.000003 


.000007 


77 


.028943 


.289671 


115 


.001651 


.991805 


40 


.000005 


.000012 


78 


.030183 


.319854 


116 


.001382 


.993187 


41 


.000008 


.000020 


79 


.031236 


.351090 


117 


.001152 


.994339 


42 


.000013 


.000033 


80 


.032085 


.383175 


118 


.000957 


.995296 


43 


.000020 


.000053 


81 


.032715 


.415890 


119 


.000791 


.996087 


44 


.000031 


.000084 


82 


.033116 


.449006 


120 


.000651 


.996738 


45 


.000046 


.000130 


83 


.033285 


.482291 


121 


.000533 


.997271 


46 


.000068 


.000198 


84 


.033220 


.515511 


122 


.000436 


.997707 


47 


.000099 


.000297 


85 


.032929 


.548440 


123 


.000354 


.998061 


48 


.000142 


.000439 


86 


.032419 


.580859 


124 


.000287 


.998348 


49 


.000200 


.000639 


87 


.031706 


.612565 


125 


.000231 


.998579 


50 


.000279 


.000918 


88 


.030805 


.643370 


126 


.000186 


.998765 


51 


.000383 


.001301 


89 


.029738 


.673108 


127 


.000149 


.998914 


52 


.000520 


.001821 


90 


.028526 


.701634 


128 


.000119 


.999033 


53 


.000696 


.002517 


91 


.027193 


.728827 


129 


.000094 


.999127 


54 


.000919 


.003436 


92 


.025763 


.754590 


130 


.000075 


.999202 


55 


.001199 


.004635 


93 


.024262 


.778852 


131 


.000059 


.999261 


56 


.001545 


.006180 


94 


.022711 


.801563 


132 


.000046 


.999307 


57 


.001967 


.008147 


95 


.021136 


.822699 


133 


.000036 


.999343 


58 


.002476 


.010623 


96 


.019556 


.842255 


134 


.000028 


.999371 


59 


.003082 


.013705 


97 


.017991 


.860246 


135 


.000022 


.999393 


60 


.003793 


.017498 


98 


.016459 


.876705 


136 


.000017 


.999410 


61 


.004617 


.022115 


99 


.014974 


.891679 


137 


.000013 


.999423 


62 


.005561 


.027676 


100 


.013550 


.905229 


138 


.000010 


.999433 


63 


.006629 


.034305 


101 


.012195 


.917424 


139 


.000008 


.999441 


64 


.007821 


.042126 


102 


.010917 


.928341 


140 


.000006 


.999447 


65 


.009135 


.051261 


103 


.009722 


.938063 


141 


.000005 


.999452 


66 


.010569 


.061830 


104 


.008614 


.946677 


142 


.000004 


.999456 


67 


.012109 


.073939 


105 


.007593 


.954270 


143 


.000003 


.999459 


68 


.013743 


.087682 


106 


.006660 


.960930 


144 


.000002 


.999461 


69 


.015455 


.103137 


107 


.005813 


.966743 


145 


.000001 


.999462 


70 


.017223 


.120360 


108 


.005049 


.971792 


146 


.000001 


.999463 


71 


.019025 


.139385 


109 


.004364 


.976156 


147 


.000001 


.999464 


72 


.020833 


.160218 


110 


.003754 


.979910 









The two sets of values are evidently sufficiently near 
together to be used interchangeably for most practical 
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purposes, a sort of result which is familiar to any one 
who has had any considerable experience with the method 
of moments. 

To return now to the series we find, using 10-plaee 
logarithms in the intermediate computations and the 
Forsyth approximation, 

log Co = 72.3493814 - 100. 

We have now to calculate the successive terms of the 
series. If this were done for the whole range it would in- 
volve a literally colossal amount of labor. Fortunately 
this is not necessary. We need only take that part of the 
range which includes appreciable frequencies. By a few 
trials we find that this part of the range begins with 
r = 36. In Table II are given the . frequencies for the 
several terms in the series between r— 36 and r = 147 in- 
clusive, the total area being taken as unity. To reduce 
these frequencies to the actual numbers for the second 
sample we have only to multiply in every case by 9017. 
We have calculated C se by (vii) and used the Forsyth Co. 

From this table we easily deduce 

Median = 83.5331, 
Lower quartile = 75.6104 (ix) 

Upper quartile = 91.8218. 

Now, remembering that if the same law holds for the 
second Mendelian distribution as for the first we should 
expect the x class in that distribution to be 0.582 times the 
value of the same class in the first distribution, we have 

Expected mean value of x class in 

second distribution = 49.14 
Expected modal value of x class in 

second distribution = 48 
Expected lower quartile value in 

second distribution = 44.01 
Expected upper quartile value in 

second distribution = 53.44 
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The actual experimental value obtained was 30, which, 
is far below the lower quartile. From Table II we find, 
remembering again that on a priori grounds the experi- 
mental frequencies are reduced by the factor 0.582, that if 
the two distributions were really samples of the same 
population, obeying the same Mendelian laws, it would 
be expected that the x class would show a frequency as 
low as or lower than 30, only 18 times in 10,000 trials of 
samples of 9017. Or, in other words, the odds against 
so low a value as 30 are about 556 to 1. These are about 
the same odds as those associated with the occurrence of 
a deviation 4.63 times the probable error (cf. Pearl and 
Miner 14 ). 

We may, therefore, conclude with great certainty that 
the value of 30 is significantly smaller than would be ex- 
pected to occur in the x class on the basis of chance (de- 
viation due to random sampling) if the two distributions 
were really samples of the same population. 

Let us now go back and approach the problem de novo 
by the approximate method suggested in section 4. We 
have 

^=mean — mode = 1.4281, 

1.4281 

— g— = .4760, 

84.4281- .4760 =83.9521 = median (approx.), 
12.0207 X .67449= 8.1078, 
83.9521-8.1078 =75.8443 = Lower quartile, 
83.9521 + 8.1078 = 92.0599 = Upper quartile. 

Comparing these values, in the obtaining of which all 
the tremendously tedious and time-consuming arithmetic 
involved in calculating Table II was avoided, with those 
shown in (ix) makes it quite evident that for all practical 
statistical purposes the approximate method would have 
given sufficiently accurate results. 

i* Pearl, E., and Miner, J. E., "A Table for Estimating the Probable 

Significance of Statistical Constants," Me. Agr. Expt. Stat. Ann. Eept. for 
1914, pp. 85-88. 
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Stjmmaby 

In this paper is presented a method of calculating and 
expressing the errors, due to random sampling, of a Men- 
delian class frequency. The method consists essentially 
in expressing each expected Mendelian frequency as the 
probable quartile limits for that class frequency in a sup- 
posed second sample of the same size as the observed 
sample drawn from the same population. These quartile 
limits are determined from the ordinates of a hypergeo- 
metrical series. Various simplifications of method are 
suggested and illustrated. The method is suggested as a 
supplement to, not as a substitute for, the x 2 test for the 
goodness of fit in Mendelian distributions. 



